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Abstract 

Rewards typically express desirabilities or preferences 
over a set of alternatives. Here we propose that rewards 
can be denned for any probability distribution based on 
three desiderata, namely that rewards should be real- 
valued, additive and order-preserving, where the lat- 
ter implies that more probable events should also be 
more desirable. Our main result states that rewards 
are then uniquely determined by the negative infor- 
mation content. To analyze stochastic processes, we 
define the utility of a realization as its reward rate. 
Under this interpretation, we show that the expected 
utility of a stochastic process is its negative entropy 
rate. Furthermore, we apply our results to analyze 
agent-environment interactions. We show that the ex- 
pected utility that will actually be achieved by the 
agent is given by the negative cross-entropy from the 
input-output (I/O) distribution of the coupled inter- 
action system and the agent's I/O distribution. Thus, 
our results allow for an information-theoretic interpre- 
tation of the notion of utility and the characterization 
of agent-environment interactions in terms of entropy 
dynamics. 
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Introduction 

Purposeful behavior typically occurs when an agent ex- 
hibits specific preferences over different states of the 
environment. Mathematically these preferences can be 
formalized by the concept of a utility function that as- 
signs a numerical value to each possible state such that 
states with hig her utility corres pond to states that are 
more desirable [Fishburnl j l982j . Behavior can then be 
understood as the attempt to increase one's utility. Ac- 
cordingly utility functions can be measured experimen- 
tally by observing an agent choosing between different 
options, as this way its preferences are revealed. Math- 
ematical models of rational agency that are based on 
the notion of utility have been widely applied in be- 
havioral economics, biology and a rtificial intelligence 
research [Russell and Norvia . Il995| . Typically, such ra- 
tional agent models assume a distinct reward signal (or 
cost) that an agent is explicitly trying to optimize. 

However, as an observer we might even attribute pur- 
posefulness to a system that does not have an explicit 



reward signal, because the dynamics of the system it- 
self reveal a preference structure, namely the preference 
over all possible paths through history. Since in most 
systems not all of the histories are equally likely we 
might say that some histories are more probable than 
others because they are more desirable from the point 
of view of the system. Similarly, if we regard all possible 
interactions between a system and its environment, the 
behavior of the system can be conceived as a drive to 
generate desirable histories. This imposes a conceptual 
link between the probability of a history happening and 
the desirability of that history. In terms of agent de- 
sign, the intuitive rationale is that agents should act in a 
way such that more desired histories are more probable. 
The same holds of course for the environment. Conse- 
quently a competition arises between the agent and the 
environment, where both participants try to drive the 
dynamics of their interactions to their respective de- 
sired histories. In the following we want to show that 
this competition can be quantitatively assessed based 
on the entropy dynamics that govern the interactions 
between agent and environment. 

Preliminaries 

We introduce the following notation. A set is denoted 
by a calligraphic letter like X and consists of elements 
or symbols. Strings are finite concatenations of symbols 
and sequences are infinite concatenations. The empty 
string is denoted by e. X n denotes the set of strings 
of length n based on X, and X* = U n >o ^™ ^ s ^ ne se * 
of finite strings. Furthermore, X°° = {X1X2 . . . \xt £ 
X for alH — 1,2,...} is defined as the set of one-way 
infinite sequences based on X. For substrings, the fol- 
lowing shorthand notation is used: a string that runs 
from index i to k is written as Xi-k = XiXi + i . . . Xk-iXk- 
Similarly, x<i = x\Xi . . .Xi is a string starting from the 
first index. By convention, Xi-j = e if i > j. All proofs 
can be found in the appendix. 

Rewards 

In order to derive utility functions for stochastic pro- 
cesses over finite alphabets, we construct a utility func- 
tion from an auxiliary function that measures the de- 



sirability of events, i.e. such that we can assign de- 
sirability values to every finite interval in a realization 
of the process. We call this auxiliary function the re- 
ward function. We impose three desiderata on a reward 
function. First, we want rewards to be mappings from 
events to reals numbers that indicate the degree of de- 
sirability of the events. Second, the reward of a joint 
event should be obtained by summing up the reward 
of the sub-events. For example, the "reward of drink- 
ing coffee and eating a croissant" should equal "the re- 
ward of drinking coffee" plus the "reward of having a 
croissant given the reward of drinking coffee"E|. This is 
the additivity requirement of the reward function. The 
last requirement that we impose for the reward function 
should capture the intuition suggested in the introduc- 
tion, namely that more desirable events should also be 
more probable events given the expectations of the sys- 
tem. This is the consistency requirement. 

We start out from a probability space (Cl,J-,Pr), 
where is the sample space, J 7 is a dense er-algebra 
and Pr is a probability measure over T . In this sec- 
tion, we use lowercase letters like x,y,z to denote the 
elements of the a- algebra T. Given a set X, its comple- 
ment is denote by complement X^ and its powerset by 
3^{X). The three desiderata can then be summarized 
as follows: 

Definition 1 (Reward). Let S = (f2, T, Pr) be a prob- 
ability space. A function r is a reward function for S 
iff it has the following three properties: 

1. Real-valued: for all x,y 6 J 7 , 

r(x\y) G K; 

2. Additivity: for all x,y,z G J 7 , 

r(x,y\z) = r(x\z) +r(y\x,z); 

3. Consistent: for all x, y,u,v G J 7 , 

Pr(x\u) > Pr(y\v) <^=> r(x\u) > r(y\v). 

Furthermore, the unconditional reward is defined as 
r(x) = r(x\fl) for all x G J 7 . 

The following theorem shows that these three 
desiderata enforce a strict mapping rewards and prob- 
abilities. The only function that can express such a 
relationship is the logarithm. 

Theorem 1. Let S = (f2, T, Pr) be a probability space. 
Then, a function r is a reward function for S iff for all 
x,y G T 

v{x\y) = fclogPr(a;|y), 

where k > is an arbitrary constant. 

Notice that the constant k in the expression r(x\y) = 
fclogPr(ai|y) merely determines the units in which we 
choose to measure rewards. Thus, the reward function 

1 Note that the additivity property does not imply that 
the reward for two coffees is simply twice the reward for one 
coffee, as the reward for the second coffee will be conditioned 
on having had a first coffee already. 



r for a probability space (fl, J 7 , Pr) is essentially unique. 
As a convention, we will assume natural logarithms and 
set the constant to k = 1, i.e. r(x\y) = lnPr(x\y). 

This result establishes a connection to information 
theory, ft is immediately clear that the reward of an 
event is nothing more than its negative information con- 
tent: the quantity h(x) = — r(x) is the Shan non infor- 
matio n content of x G T measured in nats [MacKavl 
[20031 . This 

means that we can interpret rewards as 
"negative surprise values", and that "surprise values" 
constitute losses. 

Proposition 1. Let r be a reward function over a prob- 
ability space (fi, J 7 , Pr). Then, it has the following prop- 
erties: 

i. Let x,y G T . Then 

-oo = r(0) < v(x\y) < r(0) = 0. 
ii. Let x G T be an event. Then, 

e r(z C ) = 1 _ e r(x)_ 

Hi. Let z\,Z2,... G T be a sequence of disjoint events 
with rewards r(zi), r(z2), . . . and let x = |L Then 

i 

The proof of this proposition is trivial and left to the 
reader. The first part sets the bounds for the values of 
rewards, and the two latter explain how to construct 
the rewards of events from known rewards using com- 
plement and countable union of disjoint events. 

At a first glance, the fact that rewards take on com- 
plicated non-positive values might seem unnatural, as 
in many applications one would like to use numerical 
values drawn from arbitrary real intervals. Fortunately, 
given numerical values representing the desirabilities of 
events, there is always an affine transformation that 
converts them into rewards. 

Theorem 2. Let f2 be a countable set, and let d : Q —> 
(— oo, a] be a mapping. Then, for every a > 0, there is 
a probability space (O, ^(f2), Pr) with reward function 
r such that: 

1. for all uj G CI, 

r({u;}) = ad(w) + A 

wheref3=-lrx(j: u , eu e ad ^); 

2. and for all lu,lu' G O, 

d(u) > d(ui') «*■ r(M) > r({u/}). 

Note that Theorem [2] implies that the probability 
Pr(ai) of any event x in the cr-algebra 3P(Ct) generated 
by fl is given by 

Pr(x) = ,, , . 

Note that for singletons {oj}, Pr({w}) is the Gibbs mea- 
sure with negative energy d{u>) and temperature oc — . 
It is due to this analogy that we call the quantity — > 
the temperature parameter of the transformation. 



Utilities in Stochastic Processes 

In this section, we consider a stochastic process Pr over 
sequences x\x 2 xz ■ ■ ■ in X 00 . We specify the process 
by assigning conditional probabilities Pr(xt\x < t) to all 
finite strings x<t £ X*. Note that the distribution 

Pr(x< t ) — nt=i PK^tI^t) for all x< t £ X* is nor- 
malized by construction. By the Kolmogorov exten- 
sion theorem, it is guaranteed that there exists a unique 
probability space S = (X°° , J 7 , Pr) . We therefore omit 
the reference to S and talk about the process Pr. 

The reward function r derived in the previous section 
correctly expresses preference relations amongst differ- 
ent outcomes. However, in the context of random se- 
quences, it has the downside that the reward of most 
sequences diverges. A sequence xix 2 x^ ■ ■ ■ can be inter- 
preted as a progressive refinement of a point event in T , 
namely, the sequence of events e D x<i D x< 2 3 x<3 3 
• • • . One can exploit the interpretation of the index as 
time to define a quantity that does not diverge. We 
define thus the utility as the reward rate of a sequence. 

Definition 2 (Utility). Let r be a reward function for 
the process Pr. The utility of a string x< t £ X* is 
defined as 

1 * 

U(x< t ) = -j-^r(x T \x <T ), 

r=l 

and for a sequence x = x\x 2 x^ ■ ■ ■ £ X°° it is defined 

as 

U(z) = lim U(x< t ) 

t — >oo 

if this limit existfl 

A utility function that is constructed according to 
Definition [2] has the following properties. 

Proposition 2. Let U be a utility function for a pro- 
cess Pr. The following properties hold: 

i. For all x = x\x 2 ■ ■ ■ £ X°° and all t, k £ N, 

-00 = U(A) < U(af< t ) < U(e) = 0, 

where A is any impossible string /sequence, 
ii. For all x<t £ X* , 

Pr(x< t ) = exp^t • U(x< t )\ . 

Hi. For any t £ N, 

E[U(x< 4 )] = -jH[Pr(x< t )}, 

where H is the entropy functional (see the appendix). 

Part (i) provides trivial bounds on the utilities 
that directly carry over from the bounds on rewards. 
Part (ii) shows how the utility of a sequence determines 
its probability. Part (iii) implies that the expected util- 
ity of an interaction sequence is just its negative entropy 
rate. 

2 Strictly speaking, one could define the upper and 
lower rate U + (x) = limsup^^ XJ(x< t ) and XJ~ (x) = 
liminft^oo U(x<t) respectively, but we avoid this distinc- 
tion for simplicity. 



Utility in Coupled I/O systems 

Let O and A be two finite sets, the first being the 
set of observations and the second being the set of 
actions. Using A and O, a set of interaction se- 
quences is constructed. Define the set of interactions 
as Z = A x O. A pair (a, o) £ Z is called an interac- 
tion. We underline symbols to glue them together as in 
ao< t = aiOia 2 o 2 ■ ■ ■ a t o t . 

An I/O system Pr is a probability distribution over 
interaction sequences Z°°. Pr is uniquely determined 
by the conditional probabilities 

Pr(a t \ao <t ), Pr(o t \ao <t a t ) 

for each ao <t £ Z*. However, the semantics of the 
probability distribution Pr are only fully defined once it 
is coupled to another system. Note that an I/O system 
is formally equivalent to a stochastic process; hence one 
can construct a reward function r for Pr. 

Let P, Q be two I/O systems. An interaction sys- 
tem (P, Q) defines a generative distribution G that de- 
scribes the probabilities that actually govern the I/O 
stream once the two systems are coupled. G is speci- 
fied by the equations 

G(a t \ao <t ) = P{a t \ao <t ) 

G(o t \ao<t a t) = Q(o t \ao<t a t) 

valid for all ao t £ Z* . Here, G is a stochastic pro- 
cess over Z°° that models the true probability distri- 
bution over interaction sequences that arises by cou- 
pling two systems through their I/O streams. More 
specifically, for the system P, P(at\ao <t ) is the proba- 
bility of producing action at £ A given history ao <t and 
P{ot\ao <t at) is the predicted probability of the obser- 
vation ot £ O given history go <t a t . Hence, for P, the 
sequence o\o 2 . . . is its input stream and the sequence 
a\a 2 ... is its output stream. In contrast, the roles of 
actions and observations are reversed in the case of the 
system Q. This model of interaction is very general 
in that it can accommodate many specific regimes of 
interaction. By convention, we call the system P the 
agent and the system Q the environment. 

In the following we are interested in understanding 
the actual utilities that can be achieved by an agent P 
once coupled to a particular environment Q. Accord- 
ingly, we will compute expectations over functions of 
interaction sequences with respect to G, since the gen- 
erative distribution G describes the actual interaction 
statistics of the two coupled I/O systems. 

Theorem 3. Let (P, Q) be an interaction system. The 
expected rewards of G. P and Q for the first t interac- 
tions are given by 

E[r G (oo<i)] = - H[P(a< t \o <t )] - H[Q(o< t \a< t )], 
E[r P (oo< ( )] = - H[P(a< t |o <t )] - H[Q(o< t |a< t )] 

-KL[Q(o< t |a< t )||P(o< t |a< t )], 
E[r Q (ao< t )] = - H[P(a< t |o< t )] - H[Q(o< t |a< t )] 

-KLp(o<t|o< t )||Q(o<t|o<t)], 



where re, rp and tq are the reward functions for G, 
P and Q respectively. Note that H and KL are the 
entropy and the relative entropy Junctionals as defined 
in the appendix. 

Accordingly, the interaction system's expected re- 
ward is given by the negative sum of the entropies pro- 
duced by the agent's action generation probabilities and 
the environment's observation generation probabilities. 
The agent's (actual) expected reward is given by the 
negative cross-entropy between the generative distribu- 
tion G and the agent's distribution P. The discrepancy 
between the agent's and the interaction system's ex- 
pected reward is given by the relative entropy between 
the two probability distributions. Since the relative en- 
tropy is positive, one has E[rG (ao <t )] > E[rp {ao <t )]. 
This term implies that the better the environment is 
"modeled" by the agent, the better its performance will 
be. In other words: the agent has to recognize the 
structure of the environment to be able to exploit it. 
The designer can directly increase the agent's expected 
performance by controlling the first and the last term. 
The middle term is determined by the environment and 
only indirectly controllable. Importantly, the terms are 
in general coupled and not independent: changing one 
might affect another. For example, the first term sug- 
gests that less stochastic policies improve performance, 
which is oftentimes the case. However, in the case of a 
game with mixed Nash equilibria the overall reward can 
increase for a stochastic policy, which means that the 
first term is compensated for by the third term. Given 
the expected rewards, we can easily calculate the ex- 
pected utilities in terms of entropy rates. 
Corollary 1. Let (P, Q) be an interaction system. The 
expected utilities of G, P and Q are given by 

E[U G ] = GU P + GUq 

E[U P ] = GUp + GUq + PU P 

E[U Q ] = GUp + GUq + PU q 

where GUp, GUq and PUp are entropy rates defined 
as 

1 i 

GU P = ^H[P(a T |ao <T )] 

1 r=l 
i t 

PU P = --2_)KL[Q(o r |oo <T a T )||P(o T |oo <T a T )] 

T=l 
1 I 

GUq = --^H[Q(o T |ao <T a r )] 

T=l 
1 * 

PU Q = --^KL[P(o T |ao <T )||Q(a T |oo <T )]. 

T=l 

This result is easily obtained by dividing the quanti- 
ties in Theorem [3] by t and then applying the chain rule 
for entropies to break the rewards over full sequences 
into instantaneous rewards. Note that GUp, GUq are 
the contributions to the utility due the generation of 
interactions, and PUp, PUq are the contributions to 
the utility due to the prediction of interactions. 



Examples 

One of the most interesting aspects of the information- 
theoretic formulation of utility is that it can be applied 
both to control problems (where an agent acts in a non- 
adaptive environment) and to game theoretic problems 
(where two possibly adaptive agents interact). In the 
following we apply the proposed utility measures to two 
simple toy examples from these two areas. In the first 
example, an adaptive agent interacts with a biased coin 
(the non-adaptive agent) and tries to predict the next 
outcome of the coin toss, which is either 'Head' (H) or 
'Tail' (T). In the second example two adaptive agents 
interact playing the matching pennies game. One player 
has to match her action with the other player (HH or 
TT), while the other player has to unmatch (TH or 
HT) . All agents have the same sets of possible observa- 
tions and actions which are the binary sets O = {H, T} 
and A = {H, T}. 

Example 1. The non-adaptive agent is a biased coin. 
Accordingly, the coin's action probability is given by 
its bias and was set to Q(o = H) = 0.9. The coin 
does not have any biased expectations about its obser- 
vations, so we set Q(a = H) = 0.5. The adaptive agent 
is given by the Laplace agent whose expectations over 
observed coin tosses follows the predictive distribution 
P(o = H|t, n) = (n + l)/(f + 2), where t is the number 
of coin tosses observed so far and n is the number of 
observed Heads. Based on this estimator the Laplace 
agent chooses its action deterministically according to 
P(a = B.\t, n) = e(fii-i), where O(-) is the Heaviside 
step function. From these distributions the full proba- 
bility over interaction sequences can be computed. Fig- 
ure 1A shows the entropy dynamics for a typical single 
run. The Laplace agent learns the distribution of the 
coin tosses, i.e. the KL decreases to zero. The negative 
cross-entropy stabilizes at the value of the observation 
entropy that cannot be further reduced. The entropy 
dynamics of the coin do not show any modulation. 

Example 2. The two agents are modeled based 
on smooth fictitious play [Fudenberg and Krepsl . Il993l |. 
Both players keep count of the empirical frequencies of 
Head and Tail respectively. Therefore, each player i 

stores the quantities k\ = rii and nf " 1 =t — rii where t 
is the number of moves observed so far, n\ is the number 
of Heads observed by Player I and ni is the number of 
Heads observed by Player 2. The probability distribu- 
tions P(o = H|i, n\) = 71 and Q(a = H|t, n 2 ) = 72 over 
inputs is given by these empirical frequencies through 
7i = Kj/y), Kj. The action probabilities are com- 
puted according to a sigmoid best-response function 
P(a = U\t,ni) = 1/(1 + exp(-a(7i - 0.5))), and 
Q(o = H|t, n 2 ) = 1/(1 + exp(-a(0.5 - 72))) respec- 
tively in case of Player 2 that has to unmatch. This 
game has a well-known equilibrium solution that is a 
mixed strategy Nash equilibrium where both players act 
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Figure 1: (A) Entropy dynamics of a Laplace agent interacting with a coin of bias 0.9. The Laplace agent learns 
to predict the coin's behavior as can be seen in the decrease of the KL-divergence and the cross entropy. Since 
the Laplace agent acts dctcrministically its action entropy is always zero. Its observation entropy equals the action 
entropy of the coin. The coin does not change its behavior, which can be seen from the flat entropy curves. (B) 
Entropy dynamics of two adaptive agents playing matching pennies. Both agents follow smooth fictitious play. They 
converge to uniform random policies, which means that their action negentropies converge to log(2). Both agents 
learn the probability distribution of the other agent, as can be seen in the decrease of the KL-divergences. 



randomly. Both action and observation entropies con- 
verge to the value log(2). Interestingly, the information- 
theoretic utility as computed by the cross-entropy takes 
the action entropy into account. Compare Figure IB. 

Conclusion 

Based on three simple desiderata we propose that re- 
wards can be measured in terms of information content 
and that, consequently, the entropy satisfies properties 
characteristic of a utility function. Previous theoretical 
studies have reported structural si milarities betwee n en- 
tropy and utility functions, see e.g. |Candeall . l200l| . and 
recently, relative entropy has even bee n proposed as a 
measure of ut i lity in control systems [Todoroyl I2009L 
iKappen et ail . I2009L lOrtega and Braunl . 12008^ The 
contribution of this paper is to derive axiomatically a 
precise relation between rewards and information value 
and to apply it to coupled I/O systems. 

The utility functions that we have derived can be 
conceptualized as path utilities, because they assign a 
utility value to an entire history. This is very similar 
to the path integral formulation in quantum mechan- 
ics where the utility of a path is determined by the 
classic action integral and the probability of a path is 
also obtain by takin g the exponential of this 'utility' 
[Fevnman and Hibbsl . []~965]. In particular, we obtain 
the (cumulative time-averaged) cross entropy as a util- 
ity function when an agent is coupled to an environ- 
ment. This utility function not only takes into account 
the KL-divcrgcncc as a measure of learning, but also 
the action entropy. This is interesting, because in most 
control problems controllers are designed to be deter- 
ministic (e.g. optimal control theory) in response to 
a known and stationary environment. If, however, the 
environment is not stationary and in fact adaptive as 
well, then it is a well-known result from game theory 
that optimal strategies might be randomized. The util- 



ity function that we are proposing might indeed allow 
quantifying a trade-off between reducing the KL and 
reducing the action entropy. In the future it will there- 
fore be interesting to investigate this utility function in 
more complex interaction systems. 

Appendix 
Entropy functionals 

Entropy: Let Pr be a probability distribution over 
X x y. Defin e the (average conditional) entropy 
[Sh anno nl. Il948j] as 

H[Pr(x\y)} =-^2Pt(x,y) lnPr(x|y). 

Relative Entropy: Let R~i and R"2 be two probability 
distributions over X x y. Define the (average con di- 
tional) relative entropy Kullback and Leiblerl Il95l| as 



KLlPr^l^llPr^ly)] = £ Pr 1 (x, y) In 
Proof of Theorem Q] 

Proof. Let the function g be such that g(Pr(x)) = v(x). 
Let xi, X2, ■ ■ ■ , x n £ T be a sequence of events, such 
that Pr(iri) = Pr(a;i|a; < i) > for all i — 2,...,n. 
We have Pt(xx, . . . , x n ) = fJi Pr^x^) = Pr(xi)". 
Since Pr(x) > Pr(x') r(x) > r(ai') for any x,x' £ 
JF, then Pr(x) = Pr(x') O r(x) = r(x'), and thus 
Pr(xi) = Pr(xi|x<i) O r(aji) = r{x i \x <i ) for all 
i = 2,...,n. This means, r(xx,...,x n ) = nr(xi). 
But g(Pr(xx, ■ ■ ■ , x n )) = r(xi, . . . , x n ), and hence 
g(Pr(xx) n ) — nr(xi). Similarly, for a second sequence 
of events yi, y 2 , ■ ■ ■ , Vm e T with Pr(t/i) = Pe(yi\y<i) > 
for alH = 1, . . . , m, we have g(Pr(yi) n ) — nr(yi). 

The re st of the a r gume nt parallels Shannon's entropy 
theorem [Shannon! . |1948| |. Define p = Pr(xi) and q = 



Pr(yi). Choose n arbitrarily high to satisfy q m < p n < 
q m+ . Taking the logarithm, and dividing by n\ogq 
one obtains 



m logp to 1 
— < r^- < — + - 

n log q n n 



m 
n 



logp 



logg 



where e > is arbitrarily small. Similarly, using 
g(p n ) — ng(p) and the monotonicity of g, we can write 
771 g(q) < n g(j>) < (to + 1) g(q) and thus 



m 0(77) to 1 

— < ^rr < — + ~ 

n g{q) n n 



777 
77 



g(p) 



<?(<?) 



< e, 



where e > is arbitrarily small. Combining these two 
inequalities, one gets 

logp g{p) 



< 2e, 



logg 5(9) 

which, fixing q, gives r(p) = g(j>) = k logp, where k > 0. 
This holds for any x\ E T with Pr(xi) > 0. □ 

Proof of Theorem [2] 

Pro(7/. For all cj,u' e f2, d(u>) > d{u>') ^> ad(uj) + > 
ad(uj') + (3 ^> r ({^}) > r {W}) because the affine 
transformation is positive. Now, the induced prob- 
ability over has atoms {u} with probabilities 
Pr({w}) = e r({w)} > and is normalized: 

p ad(u>) 

= 1. 



E 
E 



Since knowing Pr({w}) for all ui 6 ft determines the 
measure for the whole field ^(Q), (Q, ^(fi), Pr) is a 
probability space. □ 

Proof of Proposition [2] 

Proof, (i) Since -co < r(a; r |a;< r ) < for all t, then 

— 00 < }Et=i 1 '( i: t|i<t) = U(a;< t ) < for all t. (ii) 
Write Pr(x<t) as 



1 1 
Pr(a:<t) = ]J Pr(x T \x <T ) = ]J cxp(r(iz: T |:z: <r )) 

r=l r=l 
t 

= cxp^^ r(x r |x< T )^ = cxp^i • U(a;<t) 



(iii) E[U(z<i)] = E iC < t R-fet)U(a ; < t ) = 

^ x<t Fr(a:< t )ir(x<t) = -|H[Pr(a;< t )], where we have 

applied (ii) in the second equality and r(-) = ln(Pr(-)) 
in the third equality. □ 

Proof of Theorem [3] 

Proof. This proof is done by straightforward calcula- 
tion. First note that 
t 

G (ao< f ) = Y[ P(ar|ao< T )Q(or|ao <r a r ) 

T=l 

= P(a< t \o <t )Q(o< t \a< t ), 



which is obtained by applying multiple times the chain 
rule for probabilities and noting that the probability 
of a symbol is fully determined by the previous sym- 
bols. Similarly P(ao< t ) = P(a<i|o<t)P(o< t |a<t) is ob- 
tained. We calculate liere E[rp(ao <t )]. The calculation 
for E[rn(ao^ f )] and E[rq(ao <t )] are omitted because 
they are analogous. 

E[r P (oo< t )] C = 3 J2 G («o<i)lnP(ao< t ) 

= G(ao< t )(lnP(a< t |o <t )+lnP( < t |a< t ) 

ao< t 

( =' J2 G(ao< t )(lnP(a< t jo <t )+lnP(o< t |a< t ) 



- In Q(o< t \a< t ) - lnQ(o< t |a 



(d) 



^ -H[P(o< t |o< t )] -H[Q(o< t |a< t )] 
-KL[Q(o< t |a< t )||P(o< t |a<t)]. 

Equality (a) follows from the definition of expecta- 
tions and the relation between rewards and probabil- 
ities. In (b) we separate the term in the logarithm into 
the action and observation part. In (c) we add and 
subtract the term Q(o<t|a<t) in the logarithm. Equal- 
ity (d) follows from the algebraic manipulation of the 
terms and from identifying the entropy terms, noting 
that G{ao <t ) = P(a< t |o< t )Q(o< f |a< t ). □ 
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